[Models] add fleet model fallback by xiaoguoguo626807 · Pull Request #7732 · PaddlePaddle/FastDeploy

xiaoguoguo626807 · 2026-05-07T07:07:32Z

Motivation

新增 PaddleFleet 作为模型推理后端（--model-impl paddlefleet），通过将 PaddleFleet TransformerLayer 中的 core_attention 替换为 FastDeploy Attention 内核，实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。

Modifications

config.py: 新增 paddlefleet 到 ModelImpl 类型定义
engine/args_utils.py: 支持 --model-impl paddlefleet CLI 参数，并补充校验逻辑
model_executor/models/paddleformers/base_fleet.py: 新增 PaddleFleetModelBase 基类、FastDeployAttention 层及 patch_paddlefleet_core_attention 替换函数
model_executor/models/paddleformers/__init__.py: 注册 PaddleFleetForCausalLM 模型类
test_fallback_fleet_model.py` 需要独立的 PaddleFormers 和 PaddleFleet 依赖,使用 pytest conftest.py 钩子机制，在测试运行时动态安装依赖，避免污染全局环境

Usage or Command

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet

Accuracy Tests

N/A（本 PR 新增 PaddleFleet 推理后端，尚未提供与参考实现的 logits 对齐数据）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T07:07:39Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-08T09:59:31Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-02 18:39:01

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 820864a | Merge base: cb2d7c0 (branch: develop)

1 Required任务 : 8/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
60(18)	42	37	5	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	不稳定问题：SHM 竞态导致 server 启动失败	中	Job
`Approval`	需要 Approval	—	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 不稳定问题（置信度: 中）

分析器: ci_analyze_unittest_fastdeploy | 错误类型: 不稳定问题 | 置信度: 中

失败用例:

用例	错误摘要
`pooling/test_Ernie4_5_reward_serving`	第二次 server 启动时工作进程找不到 `cache_ready_signal.8458` 共享内存

关键日志:

File "fastdeploy/worker/gpu_model_runner.py", line 281, in __init__
    self.cache_ready_signal = IPCSignal(
File "fastdeploy/inter_communicator/ipc_signal.py", line 110, in __init__
    self.shm = SharedMemory(name=name)
FileNotFoundError: [Errno 2] No such file or directory: '/cache_ready_signal.8458'
ERROR  api_server.py[line:146] Failed to initialize FastDeploy LLM engine, service exit now!

根因摘要: 第二次 server（无 prefix caching）启动时 SHM 名称冲突。测试依次启动两个服务器（第一个带 --enable-prefix-caching，第二个带 --no-enable-prefix-caching），两次使用相同端口 8458。第一个 server 被 clean_ports() 强杀后，cache_ready_signal.8458 共享内存未被完整清理（进程被 SIGTERM 中断，未执行 ipc_signal.clear()）；第二个 server 启动时 SHM 清理逻辑存在竞态，导致 engine 创建的新 SHM 在 worker 尝试 open 时找不到。

修复建议:

已知不稳定问题（SHM 竞态），建议先 rerun 验证是否偶发
若持续失败，检查 tests/pooling/test_Ernie4_5_reward_serving.py 的 server 切换逻辑：server_default_caching fixture 缺少 yield + teardown，未能在切换前等待第一个 server 完全退出（含 SHM cleanup）
建议 fixture 改为 yield 模式，在 teardown 中等待 os.killpg + sleep(2) 后再启动下一个 server

关联变更: PR 未修改 gpu_model_runner.py / ipc_signal.py / common_engine.py，与本次失败无直接代码关联

🔴 Approval — 需要 Approval

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。请通过人工审批。

codecov-commenter · 2026-05-08T11:49:57Z

Codecov Report

❌ Patch coverage is 32.92308% with 218 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cb2d7c0). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
.../model_executor/models/paddleformers/base_fleet.py	29.80%	207 Missing and 5 partials ⚠️
fastdeploy/model_executor/utils.py	0.00%	4 Missing ⚠️
fastdeploy/model_executor/models/model_base.py	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7732   +/-   ##
==========================================
  Coverage           ?   67.72%           
==========================================
  Files              ?      468           
  Lines              ?    65509           
  Branches           ?    10067           
==========================================
  Hits               ?    44365           
  Misses             ?    18299           
  Partials           ?     2845

Flag	Coverage Δ
GPU	`77.93% <32.92%> (?)`
XPU	`7.04% <0.30%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xiaoguoguo626807 · 2026-05-28T10:49:21Z

/re-run all-failed

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-01 14:34:00

📋 Review 摘要

PR 概述：新增 PaddleFleet 作为模型推理后端（--model-impl paddlefleet），通过替换 PaddleFleet TransformerLayer 中的 core_attention 为 FastDeploy Attention 实现 KV Cache 复用。
变更范围：model_executor/models/、config.py、engine/args_utils.py、graph_optimization/decorator.py、scripts/、tests/
影响面 Tag：[Models] [FDConfig] [Engine] [Graph Optimization] [CI]

问题

级别	文件	概述
🟡 建议	`fastdeploy/model_executor/graph_optimization/decorator.py:68`	`graph_opt_backend` 不接受位置参数，`*args` 转发将导致启用图优化时崩溃

历史 Findings 修复情况

Finding	问题	状态
F1	pip 命令字符串拼接缺少空格	✅ 已修复
F2	`params_dtype` 硬编码 bfloat16	✅ 已修复（现使用 `self.model_config.dtype or "bfloat16"`）
F3	同 F1	✅ 已修复
F4	help 文本隐式拼接缺少空格	⚠️ 仍存在
F5	`PretrainedModel` import 改为内部路径	⚠️ 仍存在
F6	同 F1	✅ 已修复
F7	`layer_number` 1-indexed vs Attention 0-indexed	⚠️ 仍存在
F8	`load_weights` 缺少日志	✅ 已修复
F9	引用不存在的 test 文件	✅ 已修复

📝 PR 规范检查

PR 描述结构合规，但 Checklist 勾选状态存在不一致：

[ ] Add unit tests：PR 已新增 tests/model_executor_fallback/test_fallback_fleet_model.py，应改为 [x]
[x] Provide accuracy results：Accuracy Tests 段填写 N/A，应改为 [ ]（括号内注明原因即可）

标题建议（可直接复制）：

[Models] Add PaddleFleet model fallback backend

PR 描述建议（点击展开，可直接复制）

## Motivation
新增 PaddleFleet 作为模型推理后端（`--model-impl paddlefleet`），通过将 PaddleFleet TransformerLayer 中的 `core_attention` 替换为 FastDeploy Attention 内核，实现在 PaddleFleet 模型结构上复用 FastDeploy 的 KV Cache 和高性能 Attention 计算。

## Modifications
- `config.py`: 新增 `paddlefleet` 到 `ModelImpl` 类型定义
- `engine/args_utils.py`: 支持 `--model-impl paddlefleet` CLI 参数，并补充校验逻辑
- `worker/worker_process.py`: 同步更新 `--model-impl` choices
- `model_executor/models/paddleformers/base_fleet.py`: 新增 `PaddleFleetModelBase` 基类、`FastDeployAttention` 层及 `patch_paddlefleet_core_attention` 替换函数
- `model_executor/models/paddleformers/__init__.py`: 注册 `PaddleFleetForCausalLM` 模型类
- `model_executor/graph_optimization/decorator.py`: 修复 `__call__` 支持位置参数（`*args`）
- `scripts/coverage_run.sh`: 新增 `isolated` 测试分类，将 fleet 相关测试置于最后运行
- `tests/model_executor_fallback/`: 新增 `conftest.py` 和 `test_fallback_fleet_model.py`

## Usage or Command
```bash
python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --model-impl paddlefleet
```

## Accuracy Tests
N/A（本 PR 新增 PaddleFleet 推理后端，尚未提供与参考实现的 logits 对齐数据；后续 PR 将补充对齐结果）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体设计合理，通过 monkey-patch core_attention 实现对 PaddleFleet 模型的 KV Cache 复用。历史 Findings 中 5/9 已修复。建议优先处理 F4（help 文本空格）和 F7（layer_id 偏移）两个遗留问题，以及本轮发现的 decorator *args 兼容性问题。

PaddlePaddle-bot · 2026-06-01T06:38:10Z

+            return self.forward(*args, **kwargs)

-        return self.graph_opt_backend(**kwargs)
+        return self.graph_opt_backend(*args, **kwargs)


🟡 建议 graph_opt_backend.__call__ 仅接受 **kwargs（GraphOptBackend.__call__(self, **kwargs)），此处转发 *args 会导致在 use_graph_opt=True 时抛出 TypeError。

当前 PaddleFleet 模型已应用 @support_graph_optimization 装饰器，若用户配置开启图优化，调用链将触发此路径。

建议修复方式：

def __call__(self, *args, **kwargs): """Decorator model.__call__() func""" if not self.use_graph_opt: return self.forward(*args, **kwargs) # graph_opt_backend 仅支持 kwargs return self.graph_opt_backend(**kwargs)

或者在 GraphOptBackend.__call__ 中同步支持 *args。

add fleet fallback

6898863

xiaoguoguo626807 had a problem deploying to Metax_ci May 7, 2026 07:07 — with GitHub Actions Failure

remove fleet depend

18cc86b

xiaoguoguo626807 had a problem deploying to Metax_ci May 8, 2026 09:34 — with GitHub Actions Failure

change import juage

5e81aaf

xiaoguoguo626807 had a problem deploying to Metax_ci May 8, 2026 09:57 — with GitHub Actions Failure